D. Dubois

Content

Presentation and installation of R

  • programming language widely used in data analysis, statistics and scientific research
  • created in 1996 by Ross Ihaka et Robert Gentleman at the University of Auckland, New Zealand
  • developed by statisticians grouped in the R Development Core Team
  • GNU GPL Licence
  • website: https://www.r-project.org/

ℹ️ R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, in the early 1990s. The development of R was motivated by the desire for an open-source statistical computing and graphics environment that was accessible to researchers and students. The name “R” was partly derived from the first letters of the creators’ first names, Ross and Robert, and also as a play on the name of the S programming language, from which R was conceived as an implementation.

Ross Ihaka, a statistician, and Robert Gentleman, who also has a background in statistics, were both faculty members at the University of Auckland when they began working on R. Their collaboration aimed to create a software environment for their students that was not only free but also flexible enough to perform a wide range of statistical analyses and graphical representations.

The initial release of R to the public occurred in 1995, and it quickly gained popularity in the statistical computing community due to its open-source nature, allowing users to modify, improve, and distribute the software. The Comprehensive R Archive Network (CRAN) was established as a repository of R and its packages, facilitating the contribution and sharing of code by users worldwide.

Strengths

  • free and open source
  • multi-platform
  • extensive package ecosystem
  • interpreted, i.e. code can be easily executed
  • variety of tools for manipulating and transforming data (tidyverse packages)
  • high-quality graphics and visualizations, with a large number of parameters for customizations
  • an active community of users

Weaknesses

  • not easy to learn at first
  • more a Command Line Interface (CLI) than a Graphical User Interface (GUI)

Install

RStudio

  • Integrated Development Environment (IDE) dedicated to R
  • user-friendly interface for R
  • free, open source and multi-platform

3 areas

  • the R interpreter (console) and terminal
  • the session environment, with the list of variables/objects, their type and their value. Also an history tab with the list of past commands
  • the help area, with also the list of R packages

The R interpreter

  • interprets the code and displays the results
  • the top and bottom arrows to navigate in the history of commands
  • “Enter” key to execute the current line of code
10 * 2
## [1] 20
  • to store a value one must create a variable and assign the value
  • the assignment operator is <-
x <- 10
  • it means that we assign the value “10” to the variable “x”
  • the variable “x” can be called later, to display its value or make calculations
x
## [1] 10

ℹ️ The use of <- as the assignment operator in R is rooted in its history and the influence of the S language.
While <- is the traditional and recommended assignment operator in R, = is also supported for assignment.
However, the use of <- is encouraged in the R community and in coding standards for clarity and tradition.

  • once a variable (or an object) is created, it appears in the “Environment” tab
  • lines of codes that have been executed appear in the “History” tab (double click on the line to run it in the console)
  • one can create as many objects as needed, and make calculations with these objects
y <- 100
x + y
## [1] 110
  • variables / objects names can contain letters, numbers and special characters “.” and “_”
  • but can’t start by a number or a special character
  • no accent
  • case sensitive, which means that x and X are two different objects

Project

  • allows you to organize files, data, and code associated with a specific task (e.g. data analysis)
  • it sets the working directory, i.e. the directory in which files will be stored (and loaded) by default for the project
  • avoids to write the absolute path of files each time we need to read or write a file
  • to create a new project : File/New Project
  • either create a new folder on the computer or use an existing one

File/New Project

New Project

Scripts

  • console does not store the code once the session is closed
  • instead of the console, we write the code in a file, called a script, and run the content of the file in the console. It allows to keep track of the code. To create a new script: File/New File/R Script
  • it is a text file, with an *.R extension
  • to run the code into the console, either Run or Ctrl + Enter

Notebooks

  • a notebook is a document that allows the user to mix executable code, visualizations, and explanatory text in a single, interactive document
  • can be exported in html or pdf
  • easy to read, share, reproduce and maintain

File/New File/R Notebook

  • title can be changed
  • do not modify the output
  • the text after the three dashes can be deleted, it only serves to explain how the notebooks work
  • block with code are called chunks
  • insert/R to insert a new chunk (or Ctrl + alt + i)
  • inside the chunk, Ctrl + Enter to execute the current line of code, or Ctrl + Maj + Enter to execute the whole chunk
  • the ouput of the chunk is displayed just below the chunk
  • as soon as you save the notebook, an html file is created

Markdown

  • text in the notebook can be written with the markdown markup language
  • simple syntax to create headings, emphasis, lists, links, and other basic text formatting
  • simplified html
  • RStudio offer a quick guide : Help/Markdown Quick Reference

Packages

  • set of user-contributed codes and functions that extend the functionality of R
  • easy to install with install.packages() or tab Packages and Install
  • developed and maintained by the R community
  • handled and stored in the CRAN network (Comprehensive R Archive Network)
  • write the name of the package and click on Install
  • to be able to use the package it must be loaded with library(package_name)
  • by convention, the library(xxx) should be placed at the top of the script/notebook

Packages

💻 Practice

  • create a new project
  • create a new notebook called “basics”
  • install and load the packages tidyverse and questionr

ℹ️ The questionr package in R is a useful tool designed primarily for social scientists, providing a set of functions to facilitate the analysis of survey data. It was developed to simplify common tasks associated with survey data analysis, including data manipulation, summarization, and visualization, as well as to provide easy access to descriptive statistics that are particularly relevant in social science research.

Data structures

Types of data

  • Numeric (whole numbers and decimal numbers)
  • Logical (TRUE/FALSE values)
  • Character (textual data)
  • Factor (categorical data)
  • NA : missing value (Not Available)

and also

  • Complex (complex numbers with real and imaginary parts)
  • POSIXct (date and time values)

typeof(objet) to display the type of an object

a <- "hello R world !"
typeof(a)
## [1] "character"

The vector

  • it’s a one-dimensional array where elements are of the same data type
  • can be created with the c() function, where c() means concatenate or combine
vec_1 <- c(1, 10, 15)
vec_1
## [1]  1 10 15
(vec_2 <- c("hello", "world"))
## [1] "hello" "world"

💡 The parenthesis in the chunk execute the line of code and displays the output

Mathematical operations are applied element-wise.

height <- c(1.75, 1.54, 1.85, 1.92)
(height_cm <- height * 100)
## [1] 175 154 185 192
weight <- c(72, 59, 110, 95)
(imc <- weight / height ^ 2)
## [1] 23.51020 24.87772 32.14025 25.77040

Other functions to create a vector

  • to create empty vectors: numeric(), character(), logical()
  • if an integer as argument to the function it defines the size of the vector, with default values: 0 for numeric, “” for character and FALSE for logical
numeric(10)
##  [1] 0 0 0 0 0 0 0 0 0 0

: to create a sequence of numbers that follow each other by an interval of 1

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

seq() to create a sequence of numbers with custom start, end and step values

seq(from=0, to=20, by=4)
## [1]  0  4  8 12 16 20

rep() to repeat a vector

rep(1:5, times=2)
##  [1] 1 2 3 4 5 1 2 3 4 5

💡 Notes:

  • seq() and rep() have other arguments, see the help with ?seq and ?rep.
  • There exists functions to generate vectors with random values: rnorm, runif, rbinom, rpois etc.

Indexation system

  • elements can be selected from a vector using square brackets []
  • the first element is indexed as 1
  • to select a single element, provide its index inside the brackets
height <- c(1.75, 1.54, 1.85, 1.92)
height[1]
## [1] 1.75

ℹ️ The decision to start indexing from 1 in R :

Ease for Non-programmers: Many users of R come from academic backgrounds in statistics, mathematics, and various fields of science, where 1-based indexing is the norm. Starting indexes at 1 can be more intuitive for individuals who are not primarily trained in computer science but are using R for data analysis, statistical modelling, and research.

Alignment with Theoretical Concepts: In statistical formulas and mathematical expressions, indexes typically start at 1. Using 1-based indexing in R makes it easier to translate these formulas and expressions directly into code without having to adjust the index values.

to select multiple elements, provide a vector of indices inside the brackets

height[c(1, 3)]
## [1] 1.75 1.85

slicing can be used to select a range of elements

height[2:4]
## [1] 1.54 1.85 1.92

The factor

  • a categorical variable (gender, species, political affiliation etc.)
  • created by calling the factor function to a vector of values
  • R assigns a label to each category and assigns an integer level (1 for the first level, 2 for the second level, and so on)
gender <- c("M", "F", "M", "M", "F", "F")
gender
## [1] "M" "F" "M" "M" "F" "F"
typeof(gender)
## [1] "character"
gender <- factor(gender)
gender
## [1] M F M M F F
## Levels: F M
typeof(gender)
## [1] "integer"
str(gender)
##  Factor w/ 2 levels "F","M": 2 1 2 2 1 1

possible to change the names of the levels with the levels function

levels(gender) <- c("Female", "Male")
str(gender)
##  Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1

Useful Functions

A function is called by its name, followed by parentheses containing zero, one, or multiple arguments.

  • ?function_name (or help("function_name")) to display the function’s help documentation.
  • Some functions have required arguments and optional arguments.
  • In the function signature, required arguments have no default value, while optional arguments do.
  • For example, the mean function has one required argument and an optional argument na.rm, which allows ignoring missing values (NA) when calculating the mean.
  • length(v): returns the length of v (number of elements).
  • mean(v): returns the mean of v.
  • min(v), max(v): return the minimum (maximum) value of v.
  • sum(v): returns the sum of the values in v.
  • range(v): returns a vector containing the minimum and maximum values.
  • unique(v): returns a vector derived from v with duplicate values removed.
mean(height)
## [1] 1.765
min(height)
## [1] 1.54

Combining vectors

With rbind and cbind

  • rbind to combine vectors by row
  • cbind to combine vectors by column
(v1 <- c(1, 2, 3))
## [1] 1 2 3
(v2 <- c(4, 5, 6))
## [1] 4 5 6

rbind

(rbind(v1, v2))
##    [,1] [,2] [,3]
## v1    1    2    3
## v2    4    5    6

cbind

(cbind(v1, v2))
##      v1 v2
## [1,]  1  4
## [2,]  2  5
## [3,]  3  6

Calculating averages on combined vectors

With rowMeans and colMeans**

  • rowMeans to calculate the mean by row
  • colMeans to calculate the mean by column
v1 <- seq(1, 10, by=2)
v2 <- seq(2, 20, by=4)
(mat <- cbind(v1, v2))
##      v1 v2
## [1,]  1  2
## [2,]  3  6
## [3,]  5 10
## [4,]  7 14
## [5,]  9 18
rowMeans(mat)
## [1]  1.5  4.5  7.5 10.5 13.5
colMeans(mat)
## v1 v2 
##  5 10

💡 There are also rowSums and colSums for calculating sums.

💻 Practice

Exercice 1. With the vectors below, which represent three households:

  1. calculate the absolute difference of income between the husband and the wife in each household (function abs)
  2. calculate the total income of the households
  3. calculate the income per person in each household
husband_income <- c(1200, 1180, 1750, 2100)
wife_income <- c(1450, 1870, 1690, 0)
household_people <- c(4, 2, 3, 2)

Exercice 2. With the vector below:

  1. what is the type of the vector
  2. change the vector as a categorical variable
  3. display the levels and the associated integers
  4. change the level “Ph.D” to “PhD”
education_levels <- c("Bachelor", "Master", "Ph.D")
(education <- sample(education_levels, 10, replace = TRUE))
##  [1] "Ph.D"     "Master"   "Bachelor" "Bachelor" "Ph.D"     "Master"  
##  [7] "Master"   "Master"   "Ph.D"     "Bachelor"

💡 sample() takes a sample of the specified size from the elements of x using either with or without replacement.

Exercice 3. With the vectors below, which represent the note of 6 students in 3 disciplines:

  1. calculate the average of each student
  2. calculate the average in each discipline
  3. display the quartiles of the python discipline
python <- c(12, 16, 8, 18, 6, 10)
html <- c(14, 9, 13, 15, 17, 11)
bdd <- c(18, 11, 14, 10, 8, 12)

The dataframe

  • two-dimensional array (with rows and columns), similar to a spreadsheet
  • each column / variable is a vector
  • vectors can be of different data type (number, character, logical, factor)
  • created with the data.frame() function
df1 <- data.frame(
  gender=factor(sample(c("M", "F"), 10, replace=T)),
  age=sample(18: 100, 10, replace=T),
  education=education
  )
df1
##    gender age education
## 1       F  53      Ph.D
## 2       F  55    Master
## 3       M  79  Bachelor
## 4       M  80  Bachelor
## 5       F  70      Ph.D
## 6       M  80    Master
## 7       M  50    Master
## 8       M  99    Master
## 9       F  43      Ph.D
## 10      M  66  Bachelor

We can access to a variable/column/vector in a dataframe with $

df1$gender
##  [1] F F M M F M M M F M
## Levels: F M
df1$gender[c(1, 3)]
## [1] F M
## Levels: F M

Let’s take an example dataset, hdv2003, provided by the questionr package that was previously loaded. We load the dataset with the following command:

data("hdv2003")

This places the dataset into the working environment. hdv2003 is a subset of the Histoire de vie survey conducted by INSEE in 2003. It contains 2,000 individuals and 20 variables.

Click on the object in the Environment tab to view the data table (you can also use the function View(hdv2003)).

💡 data() written in the console displays all available datasets (R core + loaded libraries).

Useful functions to discover a dataframe

  • names(df): displays the column names
  • nrow(df): returns the number of rows
  • ncol(df): returns the number of columns
  • dim(df): returns a vector with the dataset dimensions, i.e., nrow and ncol
  • names(df): returns the variable (column) names
  • str(df): displays information about the dataset – each variable with its type and the first few values (if you click on the blue arrow in the environment, it shows the content of str(df))
  • head(df): displays the first rows
  • tail(df): displays the last rows
dim(hdv2003)
## [1] 2000   20
names(hdv2003)
##  [1] "id"            "age"           "sexe"          "nivetud"      
##  [5] "poids"         "occup"         "qualif"        "freres.soeurs"
##  [9] "clso"          "relig"         "trav.imp"      "trav.satisf"  
## [13] "hard.rock"     "lecture.bd"    "peche.chasse"  "cuisine"      
## [17] "bricol"        "cinema"        "sport"         "heures.tv"

💻 Practice

Load the dataset hdv2003 from package questionr

  1. display the dimensions of the dataset
  2. display the names of the variables
  3. display the type of the variables

The matrix

  • a two-dimensional array consisting of elements of the same data type
  • elements of a matrix are organized in rows and columns

Creation:

  • matrix(data, nrow, ncol, byrow=FALSE, dimnames=NULL)
  • diag(n) : diagonal matrix with n ones on the diagonal
  • diag(v) : diagonal matrix with the vector v on the diagonal
# 3x3 matrix with values from 1 to 9
matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Matrix operations

  • %*% : matrix product
  • t(mat) : matrix transpose
  • solve(mat) : matrix inverse
  • det(mat) : matrix determinant

ℹ️ More info:

The array

  • a multi-dimensional data structure that can store data in more than two dimensions
  • similar to matrices but are generalized to an arbitrary number of dimensions
  • array() function to create arrays.
  • like matrices, all elements in an array must be of the same data type.
(my_array <- array(1:18, dim = c(3, 3, 2)))
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   10   13   16
## [2,]   11   14   17
## [3,]   12   15   18

The list

  • can contain different types of data / objects
  • created using the list() function and elements are separated by commas
  • similar to a dictionary in other languages (with pairs of key / value)
mylist <- list("my_vector"=1:10, "my_sentence"="Hello world!")
mylist
## $my_vector
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $my_sentence
## [1] "Hello world!"

an element can be accessed by its name with $

mylist$my_vector
##  [1]  1  2  3  4  5  6  7  8  9 10

Dataset preparation

The tidyverse

Set of libraries aimed at facilitating the management of the dataset (data cleaning, data wrangling, preprocessing)

library(tidyverse)

readr (import of data), tibble (new dataframe), forcats (qualitative variables), stringr (for characters), tidyr (data formatting), dplyr (data manipulation), ggplot2 (visualization), purrr (programming)

The Tibble

  • modern version of the data frame
  • compared to the dataframe, the tibble
    • has no rownames
    • allows column names with spaces, special characters, numbers etc.
    • is displayed optimally in the console, over a few lines and with some information about each variable
  • functions and libraries of the tidyverse accept a dataframe as argument but return a tibble
as_tibble(hdv2003)
## # A tibble: 2,000 × 20
##       id   age sexe  nivetud        poids occup qualif freres.soeurs clso  relig
##    <int> <int> <fct> <fct>          <dbl> <fct> <fct>          <int> <fct> <fct>
##  1     1    28 Femme Enseignement…  2634. Exer… Emplo…             8 Oui   Ni c…
##  2     2    23 Femme <NA>           9738. Etud… <NA>               2 Oui   Ni c…
##  3     3    59 Homme Derniere ann…  3994. Exer… Techn…             2 Non   Ni c…
##  4     4    34 Homme Enseignement…  5732. Exer… Techn…             1 Non   Appa…
##  5     5    71 Femme Derniere ann…  4329. Retr… Emplo…             0 Oui   Prat…
##  6     6    35 Femme Enseignement…  8675. Exer… Emplo…             5 Non   Ni c…
##  7     7    60 Femme Derniere ann…  6166. Au f… Ouvri…             1 Oui   Appa…
##  8     8    47 Homme Enseignement… 12892. Exer… Ouvri…             5 Non   Ni c…
##  9     9    20 Femme <NA>           7809. Etud… <NA>               4 Oui   Appa…
## 10    10    28 Homme Enseignement…  2277. Exer… Autre              2 Non   Prat…
## # ℹ 1,990 more rows
## # ℹ 10 more variables: trav.imp <fct>, trav.satisf <fct>, hard.rock <fct>,
## #   lecture.bd <fct>, peche.chasse <fct>, cuisine <fct>, bricol <fct>,
## #   cinema <fct>, sport <fct>, heures.tv <dbl>

Loading a dataset

  • csv format is a good practice and a standard to share data
  • 2 functions: read_csv and read_csv2
  • read_csv: “,” as separator and “.” as decimal
  • read_csv2: “;” as separator and “,” as decimal

💡 Notes:

  • There exists also read_excel, read_stata, read_sas etc. to import data from other formats.
  • RStudio provides a graphical interface to import data, by clicking on “Import Dataset” in the environment tab

Exporting data

To export a dataset in a csv file, there exists write_csv and write_csv2.

Serialization

  • Used to save R objects in R format
  • Allows saving multiple R objects in a single file (extension *.Rdata)
  • Function save(objects, file="file.Rdata")
  • Function load("file.Rdata") to reload the objects into the R session environment

Recoding categorical variables

With forcats {#forcats}

Rename the levels

fct_recode to rename the levels

freq(hdv2003$qualif)
##                            n    % val%
## Ouvrier specialise       203 10.2 12.3
## Ouvrier qualifie         292 14.6 17.7
## Technicien                86  4.3  5.2
## Profession intermediaire 160  8.0  9.7
## Cadre                    260 13.0 15.7
## Employe                  594 29.7 35.9
## Autre                     58  2.9  3.5
## NA                       347 17.3   NA

💡 The freq function from the questionr package provides information on a categorical variable: counts, percentages (with and without NAs). (?questionr::freq).

💡 The categories of a categorical variable are called levels.

levels(hdv2003$qualif)
## [1] "Ouvrier specialise"       "Ouvrier qualifie"        
## [3] "Technicien"               "Profession intermediaire"
## [5] "Cadre"                    "Employe"                 
## [7] "Autre"

fct_recode(vector, new name = old name, ...)

hdv2003$qualif_grouped <- fct_recode(
  hdv2003$qualif,
  "Ouvrier_spe" = "Ouvrier specialise",
  "Ouvrier_qual" = "Ouvrier qualifie",
  "Interm" = "Profession intermediaire"
)
freq(hdv2003$qualif_grouped)
##                n    % val%
## Ouvrier_spe  203 10.2 12.3
## Ouvrier_qual 292 14.6 17.7
## Technicien    86  4.3  5.2
## Interm       160  8.0  9.7
## Cadre        260 13.0 15.7
## Employe      594 29.7 35.9
## Autre         58  2.9  3.5
## NA           347 17.3   NA

to recode a factor as a missing value (NA), set NULL as new name

hdv2003$qualif_grouped <- fct_recode(hdv2003$qualif_grouped, NULL="Autre")
freq(hdv2003$qualif_grouped)
##                n    % val%
## Ouvrier_spe  203 10.2 12.7
## Ouvrier_qual 292 14.6 18.3
## Technicien    86  4.3  5.4
## Interm       160  8.0 10.0
## Cadre        260 13.0 16.3
## Employe      594 29.7 37.2
## NA           405 20.2   NA

To code the NA values as a factor, use fct_na_value_to_level

levels(hdv2003$qualif_grouped)
## [1] "Ouvrier_spe"  "Ouvrier_qual" "Technicien"   "Interm"       "Cadre"       
## [6] "Employe"
hdv2003$qualif_grouped <- fct_na_value_to_level(hdv2003$qualif_grouped)
levels(hdv2003$qualif_grouped)
## [1] "Ouvrier_spe"  "Ouvrier_qual" "Technicien"   "Interm"       "Cadre"       
## [6] "Employe"      NA

Group some levels

fct_collapse

hdv2003$qualif_rec <- fct_collapse(
  hdv2003$qualif, 
  "Ouvrier"=c("Ouvrier specialise", "Ouvrier qualifie"), 
  "Interm"=c("Technicien", "Profession intermediaire"))
freq(hdv2003$qualif_rec)
##           n    % val%
## Ouvrier 495 24.8 29.9
## Interm  246 12.3 14.9
## Cadre   260 13.0 15.7
## Employe 594 29.7 35.9
## Autre    58  2.9  3.5
## NA      347 17.3   NA

💡 Notes:

  1. fct_other() to group a set of categories into the “Other” category.
  2. fct_lump() to group the least frequent categories into the “Other” category.
  3. Graphical interface provided by questionr, accessible via AddinsLevels Recoding.

Reorder the levels

hdv2003$qualif_rec <- fct_relevel(
    hdv2003$qualif,
    "Cadre", "Profession intermediaire", "Technicien", 
    "Employe", "Ouvrier qualifie", "Ouvrier specialise",
    "Autre"
)
freq(hdv2003$qualif_rec)
##                            n    % val%
## Cadre                    260 13.0 15.7
## Profession intermediaire 160  8.0  9.7
## Technicien                86  4.3  5.2
## Employe                  594 29.7 35.9
## Ouvrier qualifie         292 14.6 17.7
## Ouvrier specialise       203 10.2 12.3
## Autre                     58  2.9  3.5
## NA                       347 17.3   NA
str(hdv2003$qualif_rec)
##  Factor w/ 7 levels "Cadre","Profession intermediaire",..: 4 NA 3 3 4 4 5 5 NA 7 ...

💻 Practice

  1. with the hdv2003 dataset, recode the variable relig in order to get
                              n    % val%
Pratiquant                  708 35.4 35.4
Appartenance                760 38.0 38.0
Ni croyance ni appartenance 399 20.0 20.0
Rejet                        93  4.7  4.7
NSP                          40  2.0  2.0
  1. recode the variable nivetud to get
                                          n    % val%
N'a jamais fait d'etudes                 39  2.0  2.1
Études primaires                        427 21.3 22.6
1er cycle                               204 10.2 10.8
2eme cycle                              183  9.2  9.7
Enseignement technique ou professionnel 594 29.7 31.5
Enseignement superieur                  441 22.0 23.4
NA                                      112  5.6   NA
  1. reorder the factors of this variable to get
                                         n    % val%
Enseignement superieur                  441 22.0 23.4
Enseignement technique ou professionnel 594 29.7 31.5
2eme cycle                              183  9.2  9.7
1er cycle                               204 10.2 10.8
Études primaires                        427 21.3 22.6
N'a jamais fait d'etudes                 39  2.0  2.1
NA                                      112  5.6   NA

Create a categorical variable

from a numeric variable

With cut function

  • Creating classes from a numerical variable
  • Example: an income variable
  • The breaks argument allows defining class intervals

breaks=number: the interpreter will create number classes of equal width

hdv2003$age_classe <- cut(hdv2003$age, breaks=5)
freq(hdv2003$age_classe)
##               n    % val%
## (17.9,33.8] 454 22.7 22.7
## (33.8,49.6] 628 31.4 31.4
## (49.6,65.4] 556 27.8 27.8
## (65.4,81.2] 319 16.0 16.0
## (81.2,97.1]  43  2.1  2.1

breaks=vector: the interpreter will create classes according to the values in the vector

hdv2003$age_classe2 <- cut(
    hdv2003$age, 
    breaks = c(18, 25, 35, 45, 55, 65, 100), 
    include.lowest = T
)
freq(hdv2003$age_classe2)
##            n    % val%
## [18,25]  191  9.6  9.6
## (25,35]  338 16.9 16.9
## (35,45]  390 19.5 19.5
## (45,55]  414 20.7 20.7
## (55,65]  305 15.2 15.2
## (65,100] 362 18.1 18.1

💡 package questionr provides a graphical interface for cut: Addins/Numeric range dividing

if_else

  • for a binary situation
  • if_else takes 3 arguments, the test, the value if the result of the test is TRUE and the value if it is FALSE
v <- c(12, 14, 8, 16, 4)
if_else(v > 10, "greater than 10", "lower than (or equal to) 10")
## [1] "greater than 10"             "greater than 10"            
## [3] "lower than (or equal to) 10" "greater than 10"            
## [5] "lower than (or equal to) 10"

the test can be built with several conditions

hdv2003$statut <- if_else(
    hdv2003$sexe == "Homme" & hdv2003$age > 60,
    "Men over 60",
    "Other"
)
freq(hdv2003$statut)
##                n    % val%
## Men over 60  222 11.1 11.1
## Other       1778 88.9 88.9

case_when

an extension of if_else when several conditions are needed.
It is written as condition ~ value
⚠️tthe conditions are tested sequentially, so it is necessary to start from the most specific condition

hdv2003$statut <- case_when(
    hdv2003$age > 60 & hdv2003$sexe == "Homme" ~ "Man over 60",
    hdv2003$age > 50 & hdv2003$sexe == "Femme" ~ "Woman over 50",
    TRUE ~ "Other"
)
freq(hdv2003$statut)
##                  n    % val%
## Man over 60    222 11.1 11.1
## Other         1312 65.6 65.6
## Woman over 50  466 23.3 23.3

As TRUE is always TRUE, all the other values will be coded as “Other”.

💻 Practice

  1. cut the variable heures.tv to get the following frequency table
       n   %    val%
[0,1]  684 34.2 34.3
(1,2]  535 26.8 26.8
(2,4]  594 29.7 29.8
(4,6]  138  6.9  6.9
(6,12]  44  2.2  2.2
NA       5  0.2   NA
  1. with if_else, create the variable cinema_bd which groups individuals who go to cinema and read bd. Others are put in the category “Other”. The result should be similar to
               n    % val%
Autre        1971 98.6 98.6
Cinéma et BD   29  1.5  1.5
  1. with case_when create the variable gender_bs with the following categories : Man with more than 2 brothers and sisters, Woman with more than 2 brothers and sisters, Other. The expected result is
                                         n    % val%
Other                                  1001 50.0 50.0
Woman with more than 2 B&S             546  27.3 27.3
Man with more than 2 B&S               453  22.7 22.7

Data formatting

to manipulate the data, it has to be well organized, “tidy”:

  • each variable is in a unique column
  • each column contains only one variable
  • each row corresponds to one observation for each variable
  • cells represent the values of each observation for each variable

tidy

example of a non-tidy dataset: population of 3 countries over 4 years

Data non tidy

for this dataset to be “tidy”, we need

  • one observation per row (here 4)
  • a variable “year” and a variable “population”

The dataset should be of the form

Data tidy

pivot_longer

Function pivot_longer to create a longer dataset, i.e. with more rows

##   country     2002     2007
## 1 Belgium 10311970 10392226
## 2  France 59925035 61083916
## 3 Germany 82350671 82400996

columns 2002 and 2007 should be in a columns “year” and the values in a columns “population”

pivot_longer(data=tidy_ex, cols=c("2002", "2007"), names_to="annee", values_to="population")
## # A tibble: 6 × 3
##   country annee population
##   <chr>   <chr>      <dbl>
## 1 Belgium 2002    10311970
## 2 Belgium 2007    10392226
## 3 France  2002    59925035
## 4 France  2007    61083916
## 5 Germany 2002    82350671
## 6 Germany 2007    82400996

pivot_wider

Function pivot_wider to create a wider dataset, i.e. with more columns

##   country continent year variable        value
## 1 Belgium    Europe 2002  lifeExp       78.320
## 2 Belgium    Europe 2002      pop 10311970.000
## 3 Belgium    Europe 2007  lifeExp       79.441
## 4 Belgium    Europe 2007      pop 10392226.000
## 5  France    Europe 2002  lifeExp       79.590
## 6  France    Europe 2002      pop 59925035.000
## 7  France    Europe 2007  lifeExp       80.657
## 8  France    Europe 2007      pop 61083916.000

the column “variable” should be divided in two variables, lifeExp and pop

pivot_wider(data=tidy_ex1, names_from=variable, values_from=value)
## # A tibble: 4 × 5
##   country continent  year lifeExp      pop
##   <chr>   <chr>     <dbl>   <dbl>    <dbl>
## 1 Belgium Europe     2002    78.3 10311970
## 2 Belgium Europe     2007    79.4 10392226
## 3 France  Europe     2002    79.6 59925035
## 4 France  Europe     2007    80.7 61083916

💡 Notes:

  • separate and unite functions to separate and unite columns
  • separate_rows to separate a column into several rows
  • complete function to complete missing combinations

Dataset Manipulation

dplyr provides functions that facilitates data manipulation

For the examples, we use the datasets from the package nycflights23.
This is the flight data of 3 airports of New-York in 2023.
It is necessary to install the package nycflights23 and then to load the three datasets

data("flights")
data("airports")
data("airlines")

slice

To select rows based on their index in the dataset

slice(airports, 2:8)
## # A tibble: 7 × 8
##   faa   name                                  lat    lon   alt    tz dst   tzone
##   <chr> <chr>                               <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
## 1 AAP   Andrau Airpark                       29.7  -95.6    79    -6 A     Amer…
## 2 ABE   Lehigh Valley International Airport  40.7  -75.4   393    -5 A     Amer…
## 3 ABI   Abilene Regional Airport             32.4  -99.7  1791    -6 A     Amer…
## 4 ABL   Ambler Airport                       67.1 -158.    334    -9 A     Amer…
## 5 ABQ   Albuquerque International Sunport    35.0 -107.   5355    -7 A     Amer…
## 6 ABR   Aberdeen Regional Airport            45.4  -98.4  1302    -6 A     Amer…
## 7 ABY   Southwest Georgia Regional Airport   31.5  -84.2   197    -5 A     Amer…

filter

To select rows in the dataset based on conditions.

Example: all the flights of january

filter(flights, month==1)
## # A tibble: 36,020 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2023     1     1        1           2038       203      328              3
##  2  2023     1     1       18           2300        78      228            135
##  3  2023     1     1       31           2344        47      500            426
##  4  2023     1     1       33           2140       173      238           2352
##  5  2023     1     1       36           2048       228      223           2252
##  6  2023     1     1      503            500         3      808            815
##  7  2023     1     1      520            510        10      948            949
##  8  2023     1     1      524            530        -6      645            710
##  9  2023     1     1      537            520        17      926            818
## 10  2023     1     1      547            545         2      845            852
## # ℹ 36,010 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Several conditions

Example: all the flights with departure delay between 10 and 15 minutes

filter(flights, dep_delay >= 10 & dep_delay <= 15)
## # A tibble: 17,829 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2023     1     1      520            510        10      948            949
##  2  2023     1     1      613            600        13      834            836
##  3  2023     1     1      810            800        10     1153           1147
##  4  2023     1     1      811            800        11     1122           1136
##  5  2023     1     1      814            800        14     1048           1058
##  6  2023     1     1      833            820        13     1436           1435
##  7  2023     1     1      840            828        12     1136           1140
##  8  2023     1     1      858            845        13     1042           1045
##  9  2023     1     1      859            844        15     1207           1209
## 10  2023     1     1     1005            950        15     1226           1235
## # ℹ 17,819 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

💡 filter can be used with & (AND) and | (OR) operators.

Possible to supply the result of a function as a condition

Example: flights with the longest distance

filter(flights, distance == max(distance))
## # A tibble: 667 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2023     1     1      949            900        49       NA           1525
##  2  2023     1     1     1023           1000        23     1637           1610
##  3  2023     1     2      919            900        19     1543           1525
##  4  2023     1     2      951           1000        -9     1620           1610
##  5  2023     1     3      922            900        22     1535           1525
##  6  2023     1     3     1007           1000         7     1630           1610
##  7  2023     1     4      912            900        12     1511           1525
##  8  2023     1     4     1001           1000         1     1630           1610
##  9  2023     1     5      854            900        -6     1454           1525
## 10  2023     1     5      949           1000       -11     1600           1610
## # ℹ 657 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

select

select some columns in the dataset

Example: latitude and longitude

select(airports, lat, lon)
## # A tibble: 1,251 × 2
##      lat    lon
##    <dbl>  <dbl>
##  1  29.7  -85.0
##  2  29.7  -95.6
##  3  40.7  -75.4
##  4  32.4  -99.7
##  5  67.1 -158. 
##  6  35.0 -107. 
##  7  45.4  -98.4
##  8  31.5  -84.2
##  9  41.3  -70.1
## 10  31.6  -97.2
## # ℹ 1,241 more rows

if we add “-” before the name of the columns it displays all the columns except the one listed

select(airports, -lat)
## # A tibble: 1,251 × 7
##    faa   name                                   lon   alt    tz dst   tzone     
##    <chr> <chr>                                <dbl> <dbl> <dbl> <chr> <chr>     
##  1 AAF   Apalachicola Regional Airport        -85.0    20    -5 A     America/N…
##  2 AAP   Andrau Airpark                       -95.6    79    -6 A     America/C…
##  3 ABE   Lehigh Valley International Airport  -75.4   393    -5 A     America/N…
##  4 ABI   Abilene Regional Airport             -99.7  1791    -6 A     America/C…
##  5 ABL   Ambler Airport                      -158.    334    -9 A     America/A…
##  6 ABQ   Albuquerque International Sunport   -107.   5355    -7 A     America/D…
##  7 ABR   Aberdeen Regional Airport            -98.4  1302    -6 A     America/C…
##  8 ABY   Southwest Georgia Regional Airport   -84.2   197    -5 A     America/N…
##  9 ACK   Nantucket Memorial Airport           -70.1    47    -5 A     America/N…
## 10 ACT   Waco Regional Airport                -97.2   516    -6 A     America/C…
## # ℹ 1,241 more rows

possible to add arguments to give conditions on the columns returned

select(flights, starts_with("dep_"))
## # A tibble: 435,352 × 2
##    dep_time dep_delay
##       <int>     <dbl>
##  1        1       203
##  2       18        78
##  3       31        47
##  4       33       173
##  5       36       228
##  6      503         3
##  7      520        10
##  8      524        -6
##  9      537        17
## 10      547         2
## # ℹ 435,342 more rows

exists also ends_with, contains and matches

select can rename the columns ‘on the fly’

select(airports, latitude=lat, longitude=lon)
## # A tibble: 1,251 × 2
##    latitude longitude
##       <dbl>     <dbl>
##  1     29.7     -85.0
##  2     29.7     -95.6
##  3     40.7     -75.4
##  4     32.4     -99.7
##  5     67.1    -158. 
##  6     35.0    -107. 
##  7     45.4     -98.4
##  8     31.5     -84.2
##  9     41.3     -70.1
## 10     31.6     -97.2
## # ℹ 1,241 more rows

rename

to rename the columns, function rename

rename(airports, longitude=lon, latitude=lat, altitude=alt)
## # A tibble: 1,251 × 8
##    faa   name                      latitude longitude altitude    tz dst   tzone
##    <chr> <chr>                        <dbl>     <dbl>    <dbl> <dbl> <chr> <chr>
##  1 AAF   Apalachicola Regional Ai…     29.7     -85.0       20    -5 A     Amer…
##  2 AAP   Andrau Airpark                29.7     -95.6       79    -6 A     Amer…
##  3 ABE   Lehigh Valley Internatio…     40.7     -75.4      393    -5 A     Amer…
##  4 ABI   Abilene Regional Airport      32.4     -99.7     1791    -6 A     Amer…
##  5 ABL   Ambler Airport                67.1    -158.       334    -9 A     Amer…
##  6 ABQ   Albuquerque Internationa…     35.0    -107.      5355    -7 A     Amer…
##  7 ABR   Aberdeen Regional Airport     45.4     -98.4     1302    -6 A     Amer…
##  8 ABY   Southwest Georgia Region…     31.5     -84.2      197    -5 A     Amer…
##  9 ACK   Nantucket Memorial Airpo…     41.3     -70.1       47    -5 A     Amer…
## 10 ACT   Waco Regional Airport         31.6     -97.2      516    -6 A     Amer…
## # ℹ 1,241 more rows

💡 if the new name (or the old one) contains a space quotes are necessary

arrange

to sort the dataset according to one or several columns

arrange(flights, dep_delay)
## # A tibble: 435,352 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2023     8    16     1839           1929       -50     2254           2218
##  2  2023    12     2     2147           2225       -38     2332             35
##  3  2023     9     3     1801           1834       -33     1934           2040
##  4  2023     8    13     1804           1834       -30     1947           2040
##  5  2023     1    28     1930           1959       -29     2308           2336
##  6  2023     1    18     2103           2129       -26     2301           2336
##  7  2023     5     5     1129           1155       -26     1337           1420
##  8  2023     4    16      925            950       -25     1102           1140
##  9  2023    11    14     2055           2120       -25     2215           2239
## 10  2023    12     6     2234           2259       -25     2350             30
## # ℹ 435,342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

function desc() to reverse the order

arrange(flights, desc(dep_delay))
## # A tibble: 435,352 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2023    12    17     1953           1340      1813     2155           1543
##  2  2023    10     1     1240            659      1781     1407            835
##  3  2023     4    25     1201            659      1742     1315            818
##  4  2023     2     7     2045           1700      1665     2352           2025
##  5  2023     4    20      926            619      1627     1135            822
##  6  2023    10    29      856            600      1616     1050            805
##  7  2023     4    30     1818           1617      1561     2001           1820
##  8  2023     3    17     2027           1830      1557     2346           2139
##  9  2023     3    29      633            525      1508      820            700
## 10  2023     3    19     1154           1201      1433     1321           1335
## # ℹ 435,342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

several columns

arrange(flights, month, dep_delay)
## # A tibble: 435,352 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2023     1    28     1930           1959       -29     2308           2336
##  2  2023     1    18     2103           2129       -26     2301           2336
##  3  2023     1    30     2105           2129       -24     2340           2350
##  4  2023     1    25     1844           1905       -21     2200           2222
##  5  2023     1    27      539            600       -21      700            720
##  6  2023     1    19      820            840       -20      928           1007
##  7  2023     1    19     1932           1952       -20     2307           2301
##  8  2023     1    25      715            735       -20      916            950
##  9  2023     1    28     1435           1455       -20     1734           1805
## 10  2023     1     9      816            835       -19     1008           1040
## # ℹ 435,342 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

mutate

to create new columns/variables

Creation of a column named air_time_hours

flights <- mutate(flights, air_time_hours = air_time / 60)

possible to create several columns at once

flights <- mutate(
    flights,
    distance_km = distance / 0.62137,
    speed = distance_km / air_time_hours
)
select(flights, "Air time"=air_time_hours, "Distance (km)"=distance_km, "Speed"=speed)
## # A tibble: 435,352 × 3
##    `Air time` `Distance (km)` Speed
##         <dbl>           <dbl> <dbl>
##  1       6.12           4023.  658.
##  2       1.8            1223.  680.
##  3       3.17           2536.  801.
##  4       1.8            1024.  569.
##  5       1.33            785.  589.
##  6       2.57           1746.  680.
##  7       3.2            2536.  793.
##  8       1.98           1157.  583.
##  9       4.3            2253.  524.
## 10       2.62           1714.  655.
## # ℹ 435,342 more rows

The pipe

Actions can be chained with the pipe. The pipe operator is |>.

flights |>
  filter(dep_delay >= 10 & dep_delay <= 15) |>
  arrange(desc(dep_delay)) |>
  select(flight, origin, dest, dep_delay, air_time_hours, distance_km, speed) |>
  head(5)
## # A tibble: 5 × 7
##   flight origin dest  dep_delay air_time_hours distance_km speed
##    <int> <chr>  <chr>     <dbl>          <dbl>       <dbl> <dbl>
## 1    786 EWR    RSW          15           2.73       1719.  629.
## 2    910 LGA    MSY          15           2.92       1904.  653.
## 3    908 EWR    MSY          15           2.88       1878.  651.
## 4    503 EWR    SFO          15           5.68       4128.  726.
## 5     27 JFK    SLC          15           4.97       3203.  645.

⚠️ the actions are chained (pipeline), but nothing is stored. To keep a track of these actions you have to start with an assignment to an object.

delays <- flights |>
  filter(dep_delay >= 10 & dep_delay <= 15) |>
  arrange(desc(dep_delay)) |>
  select(flight, origin, dest, dep_delay, air_time_hours, distance_km, speed)

The result of the pipeline will be stored in the delays object

💻 Practice

  1. display rows 100 to 105 (dataset flights)
  2. display the flights of july
  3. display the flights that had a delay (arr_delay) between 5 and 15 minutes. The result must be displayed in descending order.
  4. select the columns name, lat and lon of the airports dataset airports and display lat and lon with the names latitude and longitude
  5. in the airports dataset, create a new variable alt_metres which contains the altitude in the metric system (convert feet to meters by dividing the variable by 3.28).

group_by

  • allows to apply a function to grouped data
  • example: we group the data by month and display the first index of each group (combination of group_by and slice)
flights |> group_by(month) |> slice(1)
## # A tibble: 12 × 22
## # Groups:   month [12]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2023     1     1        1           2038       203      328              3
##  2  2023     2     1      453            501        -8      934            943
##  3  2023     3     1        6           2259        67      104              5
##  4  2023     4     1        1           2205       116      111           2317
##  5  2023     5     1       10           2225       105      357            227
##  6  2023     6     1       39           1659       460      325           2000
##  7  2023     7     1        2           2256        66      401            300
##  8  2023     8     1        3           2035       208      224           2330
##  9  2023     9     1       12           2236        96      120           2359
## 10  2023    10     1       58           2250       128      302            118
## 11  2023    11     1        2           2340        22       57             54
## 12  2023    12     1      506            509        -3      807            823
## # ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, air_time_hours <dbl>,
## #   distance_km <dbl>, speed <dbl>

Group the data by company, calculate the average delay and place the result in a new variable called mean_delay (combination of group_by and mutate)
mean_delay is the average delay of each company

flights |>
  group_by(carrier) |>
  mutate(mean_delay = mean(dep_delay, na.rm=T)) |>
  slice(1) |>
  select(carrier, mean_delay)
## # A tibble: 14 × 2
## # Groups:   carrier [14]
##    carrier mean_delay
##    <chr>        <dbl>
##  1 9E            7.44
##  2 AA           14.2 
##  3 AS           12.0 
##  4 B6           23.8 
##  5 DL           15.1 
##  6 F9           35.7 
##  7 G4            3.98
##  8 HA           22.9 
##  9 MQ           10.5 
## 10 NK           18.2 
## 11 OO           19.8 
## 12 UA           17.6 
## 13 WN           16.1 
## 14 YX            4.21

💡 See later summarise to get the same information.

Group the data by month and display the flights with the longest delay (combination of group_by and filter)
displays the flights with the longest delay for each month

flights |>
  group_by(month) |>
  filter(dep_delay == max(dep_delay, na.rm=T)) |>
  select(month, dep_delay, carrier, origin, dest) |>
  arrange(month)
## # A tibble: 12 × 5
## # Groups:   month [12]
##    month dep_delay carrier origin dest 
##    <int>     <dbl> <chr>   <chr>  <chr>
##  1     1      1201 AA      JFK    STT  
##  2     2      1665 AA      LGA    DFW  
##  3     3      1557 AA      EWR    DFW  
##  4     4      1742 AA      LGA    DCA  
##  5     5      1312 UA      LGA    ORD  
##  6     6      1353 AA      LGA    DFW  
##  7     7      1413 AA      LGA    DFW  
##  8     8      1161 F9      LGA    ATL  
##  9     9      1369 AA      JFK    LAX  
## 10    10      1781 AA      EWR    ORD  
## 11    11      1074 DL      JFK    ATL  
## 12    12      1813 AA      EWR    CLT

possible to group by several variables

flights |>
  group_by(month, carrier) |>
  filter(dep_delay == max(dep_delay, na.rm=T)) |>
  select(month, carrier, dep_delay) |>
  arrange(month, carrier)
## # A tibble: 165 × 3
## # Groups:   month, carrier [165]
##    month carrier dep_delay
##    <int> <chr>       <dbl>
##  1     1 9E            773
##  2     1 AA           1201
##  3     1 AS            266
##  4     1 B6            739
##  5     1 DL           1074
##  6     1 F9           1006
##  7     1 G4             84
##  8     1 HA            167
##  9     1 MQ            140
## 10     1 NK            761
## # ℹ 155 more rows

summarise

It allows to calculate a summary statistic on a variable.

flights |>
  summarise(
    mean_dep_delay=mean(dep_delay, na.rm=T),
    mean_arr_delay=mean(arr_delay, na.rm=T)
  )
## # A tibble: 1 × 2
##   mean_dep_delay mean_arr_delay
##            <dbl>          <dbl>
## 1           13.8           4.34

Combined with group_by, summarise is extremely powerful.

Display the average delay of each company and arrange the result from the longest delay to the shortest

flights |>
  group_by(carrier) |>
  summarise(mean_dep_delay=mean(dep_delay, na.rm=T)) |>
  arrange(desc(mean_dep_delay))
## # A tibble: 14 × 2
##    carrier mean_dep_delay
##    <chr>            <dbl>
##  1 F9               35.7 
##  2 B6               23.8 
##  3 HA               22.9 
##  4 OO               19.8 
##  5 NK               18.2 
##  6 UA               17.6 
##  7 WN               16.1 
##  8 DL               15.1 
##  9 AA               14.2 
## 10 AS               12.0 
## 11 MQ               10.5 
## 12 9E                7.44
## 13 YX                4.21
## 14 G4                3.98

n=n()

To get the information about the number of observations used to make the calculations

flights |>
  group_by(carrier) |>
  summarise(
    mean_dep_delay=mean(dep_delay, na.rm=T),
    n=n()
  ) |>
  arrange(desc(mean_dep_delay))
## # A tibble: 14 × 3
##    carrier mean_dep_delay     n
##    <chr>            <dbl> <int>
##  1 F9               35.7   1286
##  2 B6               23.8  66169
##  3 HA               22.9    366
##  4 OO               19.8   6432
##  5 NK               18.2  15189
##  6 UA               17.6  79641
##  7 WN               16.1  12385
##  8 DL               15.1  61562
##  9 AA               14.2  40525
## 10 AS               12.0   7843
## 11 MQ               10.5    357
## 12 9E                7.44 54141
## 13 YX                4.21 88785
## 14 G4                3.98   671

With two variables in group_by

flights |> 
  group_by(month, dest) |>
  summarise(n=n()) |>
  arrange(desc(n))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
## # A tibble: 1,274 × 3
## # Groups:   month [12]
##    month dest      n
##    <int> <chr> <int>
##  1     3 BOS    2008
##  2     5 BOS    1798
##  3     4 BOS    1768
##  4     2 BOS    1745
##  5     1 BOS    1693
##  6     3 ORD    1654
##  7     3 MCO    1624
##  8     1 ORD    1603
##  9     5 ORD    1603
## 10    12 MCO    1585
## # ℹ 1,264 more rows

💡 If only the effectives are needed, the count function can be used

flights |> 
  count(month, dest) |>
  arrange(desc(n))
## # A tibble: 1,274 × 3
##    month dest      n
##    <int> <chr> <int>
##  1     3 BOS    2008
##  2     5 BOS    1798
##  3     4 BOS    1768
##  4     2 BOS    1745
##  5     1 BOS    1693
##  6     3 ORD    1654
##  7     3 MCO    1624
##  8     1 ORD    1603
##  9     5 ORD    1603
## 10    12 MCO    1585
## # ℹ 1,264 more rows

Variable calculated by the summarise function can be used to create an new variable with mutate

flights |> 
  group_by(month, dest) |>
  summarise(n=n()) |>
  mutate(pourcentage=n/sum(n) * 100) |>
  arrange(desc(pourcentage)) |> 
  head()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups:   month [6]
##   month dest      n pourcentage
##   <int> <chr> <int>       <dbl>
## 1     3 BOS    2008        5.08
## 2     2 BOS    1745        5.02
## 3    12 MCO    1585        4.75
## 4     4 BOS    1768        4.72
## 5     1 BOS    1693        4.70
## 6     5 BOS    1798        4.64

💻 Practice

  1. display the number of flights by month and display the result in ascending order
  2. calculate the number of flights to Los Angeles (code LAX) for each month
  3. calculate the number of flights by month and by destination and display only the max
  4. calculate the number of flights by month and create a variable that displays the percentage it represents of the annual flights

Descriptive statistics

ℹ️ Descriptive statistics are numerical summaries that describe and summarize the features of a dataset.

  • the first step consists in examining each variable separately (univariate analysis)
  • the second step consists in exploring the relationships between pairs of variables (bivariate analysis)
  • the third step consists in examining the relationships between multiple variables simultaneously (multivariate analysis). This can involve techniques such as regression analysis, factor analysis, principal component analysis, cluster analysis etc.

Use of hdv2003

Univariate analysis

The analysis depends on the type of the variable:

  • quantitative (age, income, etc.) -> numeric vector
  • qualitative (gender, profession, etc.) -> factorial vector

Quantitative variable

  • simple statistics: minimum, maximum, mean, median, variance, standard-deviation
  • corresponding functions are min(), max(), range(), mean(), median(), var(), sd()
  • one can also compute the quartiles with the quantile function

The summary() function displays all this information directly

summary(hdv2003$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   35.00   48.00   48.16   60.00   97.00

Qualitative variable

the table() function counts the number of observations of each category/factor - it is called a contingency table

  • add the argument useNa="ifany" to count for the NA values
  • The summary() function applied on a categorical variable calls this table function
table(hdv2003$qualif, useNA="ifany")
## 
##       Ouvrier specialise         Ouvrier qualifie               Technicien 
##                      203                      292                       86 
## Profession intermediaire                    Cadre                  Employe 
##                      160                      260                      594 
##                    Autre                     <NA> 
##                       58                      347

If percentages are preferred over effectives, use the function prop.table with the result of the table function as argument

prop.table(table(hdv2003$qualif, useNA="ifany")) * 100
## 
##       Ouvrier specialise         Ouvrier qualifie               Technicien 
##                    10.15                    14.60                     4.30 
## Profession intermediaire                    Cadre                  Employe 
##                     8.00                    13.00                    29.70 
##                    Autre                     <NA> 
##                     2.90                    17.35

the questionr package provides the freq() function
useful arguments: valid=F to hide the “valid” columns (columns with NA values); total=T to add a row with the total, and sort=“dec” (or “inc”) to display values in the decreasing (increasing) order

freq(hdv2003$qualif, valid=F, total=T, sort="dec")
##                             n     %
## Employe                   594  29.7
## Ouvrier qualifie          292  14.6
## Cadre                     260  13.0
## Ouvrier specialise        203  10.2
## Profession intermediaire  160   8.0
## Technicien                 86   4.3
## Autre                      58   2.9
## NA                        347  17.3
## Total                    2000 100.0

💻 Practice

  1. analyse the variable heures.tv
  2. analyse the variable trav.imp

Bivariate analysis

  • to analyse the relationship between two variables
  • the analyse depends on the types of the variables
    • both qualitative
    • one qualitative and one quantitative
    • both quantitative

Two qualitatives variables

a contingency table with the table function, with two vectors

(tb_prof_genre <- table(hdv2003$qualif, hdv2003$sexe))
##                           
##                            Homme Femme
##   Ouvrier specialise          96   107
##   Ouvrier qualifie           229    63
##   Technicien                  66    20
##   Profession intermediaire    88    72
##   Cadre                      145   115
##   Employe                     96   498
##   Autre                       21    37
  • the prop.table function computes the percentages
  • use the margin argument to indicate the axis (row=1, column=2) by which the percentages must be computed
prop.table(tb_prof_genre, margin=1) * 100
##                           
##                               Homme    Femme
##   Ouvrier specialise       47.29064 52.70936
##   Ouvrier qualifie         78.42466 21.57534
##   Technicien               76.74419 23.25581
##   Profession intermediaire 55.00000 45.00000
##   Cadre                    55.76923 44.23077
##   Employe                  16.16162 83.83838
##   Autre                    36.20690 63.79310

round() to set the number of decimals

round(prop.table(tb_prof_genre, margin=1) * 100, 2)
##                           
##                            Homme Femme
##   Ouvrier specialise       47.29 52.71
##   Ouvrier qualifie         78.42 21.58
##   Technicien               76.74 23.26
##   Profession intermediaire 55.00 45.00
##   Cadre                    55.77 44.23
##   Employe                  16.16 83.84
##   Autre                    36.21 63.79

If we change the axis, the interpretation is different. Below the distribution of professions for each gender.

round(prop.table(table(hdv2003$qualif, hdv2003$sexe), margin=2) * 100, 2)
##                           
##                            Homme Femme
##   Ouvrier specialise       12.96 11.73
##   Ouvrier qualifie         30.90  6.91
##   Technicien                8.91  2.19
##   Profession intermediaire 11.88  7.89
##   Cadre                    19.57 12.61
##   Employe                  12.96 54.61
##   Autre                     2.83  4.06

questionr provides 2 functions, lprop and cprop (with one decimal by default)

lprop(tb_prof_genre)
##                           
##                            Homme Femme Total
##   Ouvrier specialise        47.3  52.7 100.0
##   Ouvrier qualifie          78.4  21.6 100.0
##   Technicien                76.7  23.3 100.0
##   Profession intermediaire  55.0  45.0 100.0
##   Cadre                     55.8  44.2 100.0
##   Employe                   16.2  83.8 100.0
##   Autre                     36.2  63.8 100.0
##   Ensemble                  44.8  55.2 100.0

some useful arguments: add “%”, add effectives and change the number of decimals

lprop(tb_prof_genre, percent=T, n=T, digits=2)
##                           
##                            Homme    Femme    Total    n   
##   Ouvrier specialise         47.29%   52.71%  100.00%  203
##   Ouvrier qualifie           78.42%   21.58%  100.00%  292
##   Technicien                 76.74%   23.26%  100.00%   86
##   Profession intermediaire   55.00%   45.00%  100.00%  160
##   Cadre                      55.77%   44.23%  100.00%  260
##   Employe                    16.16%   83.84%  100.00%  594
##   Autre                      36.21%   63.79%  100.00%   58
##   Ensemble                   44.83%   55.17%  100.00% 1653

Statistical test

  • we can test for the independence of the two variables with a \(\chi^2\)
  • \(H_0\): the two variables (profession and gender) are independent
chisq.test(tb_prof_genre)
## 
##  Pearson's Chi-squared test
## 
## data:  tb_prof_genre
## X-squared = 387.56, df = 6, p-value < 2.2e-16
  • X-squared is the test value, it is a measure of the distance between the data and a theoretical distribution under the hypothesis that the two variables are independents
  • df is the degree of freedom
  • p-value is the probability that \(H_0\) is observed. The p-value is lower than 5%, the standard threshold, so we reject \(H_0\) and conclude that the two variables are not independent from each other.

One qualitative and one quantitative variable

Is the distribution of the numerical (quantitative) variable significantly different across the various levels of the categorical (qualitative) variable?

The mean of the quantitative variable for each category of the qualitative variable

hdv2003 |> 
  group_by(sport) |>
  summarise(mean(age))
## # A tibble: 2 × 2
##   sport `mean(age)`
##   <fct>       <dbl>
## 1 Non          52.3
## 2 Oui          40.9

Statistical test

Test whether the average age is different depending on whether the individual practices sport or not.

If the quantitative variable follows a normal distribution use the ttest (student) parametric test, otherwise the Mann-Whitney test.
if the qualitative variable has more than 2 levels, use the ANOVA test (parametric, aov()) or Kruskal-Wallis test (non parametric, kruskal.test())

To test the normality of the distribution, we can use the shapiro test. \(H_0\) the vector follows a normal distribution

shapiro.test(hdv2003$age)
## 
##  Shapiro-Wilk normality test
## 
## data:  hdv2003$age
## W = 0.98081, p-value = 9.351e-16

the p-value is lower than 5% so we reject \(H_0\), the non-parametric MW test must be used

Mann-Whitney test : wilcox.test with the argument paired=F because the sample are drawn from different populations (athletes and non-athletes are not the same individuals)

wilcox.test(age ~ sport, data=hdv2003, paired=F)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age by sport
## W = 640577, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Student test (\(H_0\): the average age is not significantly different between athletes and non-athletes)

t.test(age ~ sport, data=hdv2003)
## 
##  Welch Two Sample t-test
## 
## data:  age by sport
## t = 15.503, df = 1600.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Non and group Oui is not equal to 0
## 95 percent confidence interval:
##   9.893117 12.759002
## sample estimates:
## mean in group Non mean in group Oui 
##          52.25137          40.92531

In both cases the p-value is lower than 5% which means that the age of athletes is significantly different (here lower) than the age of non-athletes.

Two quantitative variables

Load a new dataset, provided by questionr

data("rp2018")

These are data on French municipalities with more than 2000 inhabitants ?rp2018 for a description

names(rp2018)
##  [1] "code_insee"       "commune"          "code_region"      "region"          
##  [5] "code_departement" "departement"      "pop_tot"          "pop_cl"          
##  [9] "pop_0_14"         "pop_15_29"        "pop_18_24"        "pop_75p"         
## [13] "pop_femmes"       "pop_act_15p"      "pop_chom"         "pop_agric"       
## [17] "pop_indep"        "pop_cadres"       "pop_interm"       "pop_empl"        
## [21] "pop_ouvr"         "pop_scol_18_24"   "pop_non_scol_15p" "pop_dipl_aucun"  
## [25] "pop_dipl_bepc"    "pop_dipl_capbep"  "pop_dipl_bac"     "pop_dipl_sup2"   
## [29] "pop_dipl_sup34"   "pop_dipl_sup"     "log_rp"           "log_proprio"     
## [33] "log_loc"          "log_hlm"          "log_sec"          "log_maison"      
## [37] "log_appart"       "age_0_14"         "age_15_29"        "age_75p"         
## [41] "femmes"           "chom"             "agric"            "indep"           
## [45] "cadres"           "interm"           "empl"             "ouvr"            
## [49] "etud"             "dipl_aucun"       "dipl_bepc"        "dipl_capbep"     
## [53] "dipl_bac"         "dipl_sup2"        "dipl_sup34"       "dipl_sup"        
## [57] "resid_sec"        "proprio"          "locataire"        "hlm"             
## [61] "maison"           "appart"

Correlation coefficient

The coefficient varies between -1 and +1, it measures the strength and the direction of the relation between two quantitative variables. A coefficient close to -1 indicates a strong negative relation (when one variable increases the other decreases), and a coefficient close to 1 a positive relation (both variables go in the same direction).

cor(rp2018$cadres, rp2018$dipl_sup)
## [1] 0.9291504

The statistical test can be called with the cor.test() function

\(H_0\) : the coefficient is equal to 0

cor.test(rp2018$cadres, rp2018$dipl_sup)
## 
##  Pearson's product-moment correlation
## 
## data:  rp2018$cadres and rp2018$dipl_sup
## t = 184.94, df = 5415, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9254182 0.9327024
## sample estimates:
##       cor 
## 0.9291504

we reject \(H_0\), the correlation between the two variables is significantly different from zero

💡 If the relation is not linear the Spearman correlation test (based on ranks) is more suitable : cor.test(rp2018$cadres, rp2018$dipl_sup, method="spearman")

💻 Pratice

With the dataset rp2018

  1. Analyse the relation between region and pop_chom :
  • calculate the average number of unemployed people by departement
  • test if the average number of unemployed people is different between the regions
  1. Analyse the relationship between pop_tot and pop_chom

Data Visualization

With ggplot2 package

We use a subset of the dataset rp2018

rp <- rp2018 |> filter(departement %in% c("Oise", "Rhône", "Hauts-de-Seine", "Lozère", "Bouches-du-Rhône"))
head(rp)
## # A tibble: 6 × 62
##   code_insee commune     code_region region code_departement departement pop_tot
##   <chr>      <chr>       <chr>       <chr>  <chr>            <chr>         <dbl>
## 1 13001      Aix-en-Pro… 93          Prove… 13               Bouches-du…  143097
## 2 13002      Allauch     93          Prove… 13               Bouches-du…   20860
## 3 13003      Alleins     93          Prove… 13               Bouches-du…    2516
## 4 13004      Arles       93          Prove… 13               Bouches-du…   51031
## 5 13005      Aubagne     93          Prove… 13               Bouches-du…   47208
## 6 13007      Auriol      93          Prove… 13               Bouches-du…   12771
## # ℹ 55 more variables: pop_cl <fct>, pop_0_14 <dbl>, pop_15_29 <dbl>,
## #   pop_18_24 <dbl>, pop_75p <dbl>, pop_femmes <dbl>, pop_act_15p <dbl>,
## #   pop_chom <dbl>, pop_agric <dbl>, pop_indep <dbl>, pop_cadres <dbl>,
## #   pop_interm <dbl>, pop_empl <dbl>, pop_ouvr <dbl>, pop_scol_18_24 <dbl>,
## #   pop_non_scol_15p <dbl>, pop_dipl_aucun <dbl>, pop_dipl_bepc <dbl>,
## #   pop_dipl_capbep <dbl>, pop_dipl_bac <dbl>, pop_dipl_sup2 <dbl>,
## #   pop_dipl_sup34 <dbl>, pop_dipl_sup <dbl>, log_rp <dbl>, …

Basic principles of ggplot2

  • Initialize the graphic function with ggplot()
  • Add graphical elements using geom_ (histogram, scatter plot, boxplot, bars, curves, …)
  • Customize the graphics using arguments in geom_ functions and aes()
  • The aes() function maps dataset variables to graphical elements

One variable

Histogram

geom_histogram to display the distribution of a quantitative variable

ggplot(data=rp) +
  geom_histogram(aes(x=cadres))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3 options to control the binning of data:

  • bins = X: Specify the number of bins directly. X represents the desired number of bins. This method allows ggplot2 to automatically calculate the bin width based on the range of the data and the specified number of bins.
  • binwidth = X: Set the width of each bin. X represents the desired width of each bin. This gives you direct control over the size of the bins but indirectly determines the number of bins based on the data range.
  • breaks = vector: Define the bin edges explicitly. In this case, you provide a numeric vector that specifies the boundaries of the bins. This method offers the highest level of control, allowing you to specify irregular bin widths if needed.
ggplot(data=rp) +
  geom_histogram(aes(x=cadres), breaks=seq(0, 60, 10))

Density plot

geom_density: to display the density of a quantitative variable

ggplot(data=rp) +
  geom_density(aes(x=cadres))

Both histogram and density plot can be displayed on the same graph
aes(y=after_stat(density)) to display the density on the y-axis

ggplot(data=rp, aes(x=cadres)) +
  geom_histogram(aes(y=after_stat(density)), breaks=seq(0, 60, 10)) +
  geom_density()

Boxplot

geom_boxplot: another graph that can be used to visualize a quantitative variable.

  • the box represents the first quartile (25% of the observations), the median (50%) and the third quartile (75%)
  • the main rectangle therefore represent 50% of the observations (25% are below and 25% are above)
  • the whiskers represent 1.5 times the inter-quartile.
    If one or several values are outside these whiskers they are displayed as points. Sometimes we consider theses values as outliers (additional tests must be made to be sure they can be removed from the analysis)
ggplot(data=rp) +
  geom_boxplot(aes(y=cadres)) 

One boxplot for each modalities of a qualitative variable

ggplot(data=rp) +
  geom_boxplot(aes(x=departement, y=cadres))

varwidth=T to have the width of the boxes proportional to the number of observations in each group

ggplot(data=rp) +
  geom_boxplot(aes(x=departement, y=cadres), varwidth=T)

Barplot

geom_bar: to display the effectives of each category of a qualitative variable

ggplot(rp) +
  geom_bar(aes(x=departement))

To display the proportions, we need to use after_stat(prop) with group=1 to calculate the proportion for the whole dataset

ggplot(rp) +
  geom_bar(aes(x=departement, y=after_stat(prop), group=1))

Another solution is to use geom_col to display calculated values

rp |> 
  group_by(departement) |> 
  summarise(n=n()) |> 
  mutate(percent=n/sum(n)*100) |>
  ggplot() +
  geom_col(aes(x=departement, y=percent))

The slight difference between geom_bar(after_stat) and geom_col with summarise lies in when proportions are computed:

  1. geom_bar(): automatically counts occurrences and applies after_stat(prop) at the plotting stage. Proportions are computed after binning, which may introduce slight rounding differences.

  2. summarise() + geom_col(): counts (n = n()) and proportions (n / sum(n) * 100) are pre-computed before plotting. This guarantees a direct control over displayed percentages.

💡 Use geom_bar() for a quick approach without preprocessing and geom_col() for full control over calculations and additional statistics.

Pie

geom_col with polar coordinates
ggplot2 does not provide a geom_pie function, but it is possible to create a pie chart with geom_bar and polar coordinates

rp |> 
  group_by(departement) |> 
  summarise(n=n()) |> 
  mutate(percent=n/sum(n)*100) |>
  ggplot() +
  geom_col(aes(x="", y=percent, fill=departement)) + 
  coord_polar("y", start=0)

Two variables

Scatter plot

geom_point

ggplot(data=rp2018) +
  geom_point(aes(x=cadres, y=dipl_sup))

to add the regression line in the graph, use the geom_smooth function

ggplot(data=rp2018, aes(x=cadres, y=dipl_sup)) +
  geom_point() +
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

💡 Since the aes function should be used twice, we can add it in the ggplot call

Line plot

geom_line

Load dataset ‘economics’ from the package ‘ggplot2’

data(economics, package="ggplot2")
ggplot(data=economics) +
  geom_line(aes(x=date, y=unemploy))

Three and four variables

Mapping

  • associate an attribute of the graph (color, point size, marker) to the value of a categorical variable
  • the mapping is done in the aes function
ggplot(rp) +
  geom_point(aes(x = dipl_sup, y = cadres, color = departement))

A fourth dimension through the size of the points which represents the total population of the commune

ggplot(rp) +
  geom_point(aes(x=dipl_sup, y=cadres, color=departement, size=pop_tot))

Possible to customize a mapping with scale_

ggplot(rp) +
  geom_point(aes(x=dipl_sup, y=cadres, color=departement, size=pop_tot)) +
  scale_size("Population", breaks=c(0, 5000, 10000, 50000, 100000), range=c(0, 10))

Faceting

Repeat a graph several times according to one or several variables

ggplot(rp) +
  geom_histogram(aes(x=cadres)) +
  facet_wrap(vars(departement))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(rp) +
  geom_point(aes(x=dipl_sup, y=cadres, color=pop_tot)) + 
  facet_wrap(~departement)

Custumization

Titles and labels

labs function to add titles and labels

ggplot(rp) +
  geom_point(aes(x=dipl_sup, y=cadres, color=pop_tot)) + 
  labs(title="Relation between Executives and Graduates", x="Graduates", y="Executives")

Axes

scale_x_continuous, scale_y_continuous to customize the axes

ggplot(rp) +
  geom_point(aes(x=dipl_sup, y=cadres, color=pop_tot)) + 
  scale_x_continuous("Graduates", breaks=seq(0, 60, 10), limits=c(0, 60)) +
  scale_y_continuous("Executives", breaks=seq(0, 60, 10), limits=c(0, 60))

scale_x_discrete, scale_y_discrete for qualitative variables

ggplot(rp) +
  geom_bar(aes(x=departement)) + 
  scale_x_discrete("Departement") +
  scale_y_continuous("Number of observations")

Legend title

With labs

ggplot(rp) +
  geom_point(aes(x=dipl_sup, y=cadres, color=pop_tot)) + 
  labs(color="Population")

Practice

With the dataset rp2018

  • a geom_point with x=dipl_aucun and y=ouvr, with axes’ labels, and points in the blue color
  • same graph than above with a color mapping on region
  • boxplot: distribution of the variable proprio depending on the size of the department (pop_cl)
  • geom_point with x=dipl_sup and y=cadres, with a graph for each region (facet_wrap)
  • boxplot: distribution of the average proportion of owners by department inside each region

Multivariate analysis

Linear regression

Linear regression is a technique used for modelling the relationship between a dependent variable and one or more independent variables.

  • performed using the lm(formula, data) function, where formula specifies the relationship between the variables, and data is the dataset
  • the pattern: y ~ x1 + x2 + ... + xn, where y is the dependent variable and x1, x2, ..., xn are the independent variables
  • R calculates the regression coefficients (slopes) and intercept to estimate the best-fit line that minimizes the sum of squared residuals
  • The summary() function provides details about the linear regression model, including the coefficient estimates, standard errors, t-values, and p-values.
    • Estimate: coefficients of the independent variables
    • Std. Error: used to compute the confidence interval and the t-statistic
    • t value: the coefficient divided by the standard error
    • Pr(>|t|): the p-value
x1 <- c(1, 2, 3, 4, 5)
x2 <- c(2, 3, 2, 5, 4)
y <- c(2, 3, 5, 7, 11)
lm_mod <- lm(y ~ x1 + x2)
summary(lm_mod)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##    1    2    3    4    5 
##  0.8 -0.3 -0.9 -0.5  0.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.6500     1.4921  -0.436   0.7056  
## x1            2.3500     0.5256   4.471   0.0466 *
## x2           -0.2500     0.6374  -0.392   0.7327  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.14 on 2 degrees of freedom
## Multiple R-squared:  0.9492, Adjusted R-squared:  0.8984 
## F-statistic: 18.69 on 2 and 2 DF,  p-value: 0.05078

Logistic regression

  • Logistic regression is a statistical model used for modelling the relationship between a binary dependent variable and one or more independent variables
  • “logistic” because it uses the logistic function to model a binary dependent variable
  • the output of logistic regression is a probability that the given input point belongs to a certain class
  • performed using the glm() function, which stands for “generalized linear model”
  • syntax is glm(formula, data, family), where family specifies the error distribution and link function to use. For logistic regression, we use family = binomial
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(0, 0, 0, 0, 1, 0, 1, 1, 1, 1)
glm_mod <- glm(y ~ x, family=binomial)
summary(glm_mod)
## 
## Call:
## glm(formula = y ~ x, family = binomial)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -7.159      4.759  -1.504    0.133
## x              1.302      0.840   1.550    0.121
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13.863  on 9  degrees of freedom
## Residual deviance:  5.018  on 8  degrees of freedom
## AIC: 9.018
## 
## Number of Fisher Scoring iterations: 6

ℹ️ The coefficients are odds ratios: for a one unit increase in the independent variable, the odds of the outcome occurring are \(e^\beta\) times larger, where \(\beta\) is the coefficient for the independent variable. If \(β = 1.3\), then \(e^\beta = 3.675\). This means that for each additional unit of \(x\), the odds of \(y=1\) are 3.67 times higher.

To convert odds to a probability : Probability = Odds / (1 + Odds). With odds of 3.67, the probability would be 3.67 / (1 + 3.67) = 0.786 or 78.6%. So, for each additional unit of \(x\), the probability of \(y=1\) increases by 78.6%, ceteris paribus.

Multinomial logistic regression

  • Multinomial logistic regression is a technique used for modelling the relationship between a categorical dependent variable with more than two levels and one or more independent variables
  • estimates \(k-1\) models, where \(k\) is the number of levels of the dependent variable
  • each of these models compares one of the \(k-1\) categories to a reference category
  • performed using the multinom() function from the nnet package
  • multinom(formula, data)

The coefficients in the output can be interpreted similarly to binary logistic regression, but each coefficient now corresponds to the log of the odds of the outcome being in the corresponding category versus the reference category.

ℹ️ Exponentiating the coefficient transforms it from log-odds back into plain odds. This value tells you how the odds of being in the specified category (compared to the reference category) change with a one-unit increase in the independent variable. A value above 1 indicates an increase in odds, a value below 1 indicates a decrease, and a value of 1 means no change. Example : Suppose you have a multinomial logistic regression model predicting the preferred type of pet (cats, dogs, or birds, with “birds” as the reference category) based on an independent variable like age. If the coefficient for “cats” is 0.5, then the log-odds of preferring cats over birds increase by 0.5 for each additional year of age. Exponentiating 0.5 gives you the odds ratio (about 1.65), indicating that with each additional year of age, the odds of preferring cats over birds increase by 65%.

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- factor(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3))
multi_mod <- multinom(y ~ x)
## # weights:  9 (4 variable)
## initial  value 10.986123 
## iter  10 value 0.934296
## iter  20 value 0.544624
## iter  30 value 0.329800
## iter  40 value 0.222875
## iter  50 value 0.142208
## iter  60 value 0.104801
## iter  70 value 0.066253
## iter  80 value 0.048903
## iter  90 value 0.046977
## iter 100 value 0.045428
## final  value 0.045428 
## stopped after 100 iterations
summary(multi_mod)
## Call:
## multinom(formula = y ~ x)
## 
## Coefficients:
##   (Intercept)        x
## 2   -36.19612 10.27190
## 3   -89.31736 18.44352
## 
## Std. Errors:
##   (Intercept)        x
## 2    68.70837 18.81918
## 3    99.69833 21.83700
## 
## Residual Deviance: 0.0908552 
## AIC: 8.090855

Panel Data Analysis

  • Panel data, also known as longitudinal or cross-sectional time-series data, are data that are collected on multiple entities (such as individuals, firms, countries) over multiple time periods.
  • Need to install and laod the plm package
  • syntax is plm(formula, data, index, model), where index is a vector giving the names or indices of the individual and the time indexes, and model specifies the type of model: within, random or pooling
  • Within (Fixed Effects) Model: This model assumes that something within each individual (or group) may impact or bias the independent variable and needs to be controlled for. In other words, individual-specific effects are allowed to be correlated with the other regressors. The within transformation eliminates the individual effect by demeaning the data. The fixed effects model controls for the effect of those time-invariant features of the individuals in the panel in order to allow for a cleaner estimation of the predictors’ effect on the outcome variable.
  • Random Effects Model: This model assumes that the individual-specific effects are uncorrelated with the other regressors. In this model, variation across entities is assumed to be random and uncorrelated with the predictor or independent variables included in the model. If the random effects assumption holds, the random effects estimator is more efficient than the fixed effects estimator.
  • Pooling (Pooled OLS) Model: This model does not control for time-invariant individual-specific heterogeneity, treating the data as a simple cross-section. It essentially pools the data and does not consider the individual effects or group effects in the error term. It might be applicable in situations where there’s no reason to assume any fixed or random effects, or for a first pass “naive” regression analysis before moving on to more complex models.
data("Produc", package = "plm")
model <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
             data = Produc, index = c("state","year"), model = "within")
summary(model)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp, 
##     data = Produc, model = "within", index = c("state", "year"))
## 
## Balanced Panel: n = 48, T = 17, N = 816
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -0.120456 -0.023741 -0.002041  0.018144  0.174718 
## 
## Coefficients:
##              Estimate  Std. Error t-value  Pr(>|t|)    
## log(pcap) -0.02614965  0.02900158 -0.9017    0.3675    
## log(pc)    0.29200693  0.02511967 11.6246 < 2.2e-16 ***
## log(emp)   0.76815947  0.03009174 25.5273 < 2.2e-16 ***
## unemp     -0.00529774  0.00098873 -5.3582 1.114e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    18.941
## Residual Sum of Squares: 1.1112
## R-Squared:      0.94134
## Adj. R-Squared: 0.93742
## F-statistic: 3064.81 on 4 and 764 DF, p-value: < 2.22e-16

Rmq: ?Produc to get information about the datasets

Multilevel Modeling

  • Multilevel models are a type of regression model for data that are grouped, or clustered in more than one category, such as students within classrooms within schools.
  • Need package lme4, and once loaded, the lmer function, which is used to fit linear and generalized linear mixed-effects models
  • syntax is lmer(formula, data), where formula describes the model to be fitted and data specifies the data frame containing the variables in the model
  • lmer(response ~ predictors + (1|group), data), where the (1|group) term specifies a random intercept for the variable group

The coefficients for the fixed effects can be interpreted in the usual way: for a one unit increase in the independent variable, the dependent variable will increase by the estimated coefficient, ceteris paribus. The random effects variance components indicate the degree of variability across the levels of the random effects variable(s).

data("sleepstudy", package = "lme4")
mlev_mod <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
summary(mlev_mod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 251.4051     9.7467   25.79
## Days         10.4673     0.8042   13.02
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371

Another example, with two random effects : s, d and dept

Dataset InstEval from the lme4 package

data("InstEval", package = "lme4")
model_extended <- lmer(y ~ service + (1 | s) + (1 | d), data = InstEval)
  • Fixed effect: service
  • Random effects:
    • (1 | s): Student (s) random intercept -> Captures student-specific rating biases.
    • (1 | d): Lecturer (d) random intercept -> Captures lecturer-specific rating tendencies.
summary(model_extended)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ service + (1 | s) + (1 | d)
##    Data: InstEval
## 
## REML criterion at convergence: 237743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0564 -0.7480  0.0406  0.7719  3.1969 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  s        (Intercept) 0.1057   0.325   
##  d        (Intercept) 0.2715   0.521   
##  Residual             1.3866   1.178   
## Number of obs: 73421, groups:  s, 2972; d, 1128
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  3.28328    0.01881 174.511
## service1    -0.09113    0.01327  -6.867
## 
## Correlation of Fixed Effects:
##          (Intr)
## service1 -0.226
  • Intercept (3.283) is the average rating when service = 0 (non-service course).
  • Effect of service1 (-0.09113)
    • If a lecture is a service course, the average rating drops by 0.091.
    • T-value = -6.867 (very large), indicating this effect is statistically significant.

Interpretation:

  • Students rate service courses lower than regular courses.
  • The decrease (-0.091 points) is statistically significant but relatively small in magnitude.

Analyse of variance

  • Students (s): Variance = 0.1057 (Std.Dev = 0.325) -> Some students systematically rate higher/lower.
  • Lecturers (d): Variance = 0.2715 (Std.Dev = 0.521) -> Some lecturers consistently receive higher/lower ratings.
  • Residual variance (1.3866) -> Most variation remains unexplained by students or lecturers.

Dimensionality Reduction and Cluster Analysis

Dimensionality Reduction

  • Dimensionality reduction is a technique used to reduce the number of variables in a dataset while retaining as much of the original information as possible
  • It is useful for simplifying complex datasets, visualizing high-dimensional data, and improving the performance of machine learning algorithms
  • Two common techniques for dimensionality reduction are Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA)
  • PCA is used for continuous variables, while MCA is used for categorical variables

Principal Components Analysis (PCA)

Steps:

  1. Standardize the dataset (mean = 0, variance = 1).
  2. Compute the covariance matrix (identifies relationships between features).
  3. Compute eigenvalues & eigenvectors (find new principal components).
  4. Project data onto new principal components (reduce dimensions).

Example: PCA on the Iris dataset.

data("iris")

Can be performed with the stats package (already installed) or with FactoMineR and factoextra

  • FactoMineR is used to perform the analysis
  • factoextra is used to extract information for the analysis and provides visualizations (fviz_)

with stats

iris_num <- iris[, -5]  # Remove the species column
pca_result <- prcomp(iris_num, scale = TRUE, center = TRUE)
  • center = TRUE -> Mean-Centering the Data
    Each variable’s mean is subtracted from its values, so the new mean becomes 0. This ensures that PCA captures variation relative to the mean rather than absolute values. If data is not centered, the first principal component might be biased towards large mean values.
  • scale = TRUE -> Standardizing the Data
    Each variable is divided by its standard deviation, so the new standard deviation becomes 1. This ensures that PCA captures the relative importance of variables based on their variance. If data is not scaled, variables with larger variances might dominate the principal components.
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

PC1 + PC2 explain 95.8% of the total variance in the data.
PC3 adds 4% more variance, but PC4 contributes almost nothing.
2 PCs is sufficient.

The screen plot is a visual representation of the variance explained by each principal component. It helps to decide how many components to keep.
The elbow method is often used to identify the number of components that explain most of the variance.
The point where the curve starts to flatten is a good indicator of the number of components to keep.

screeplot(pca_result, type = "lines", main = "Scree Plot")

pca_data <- data.frame(pca_result$x, Species = iris$Species)
head(pca_data)
##         PC1        PC2         PC3          PC4 Species
## 1 -2.257141 -0.4784238  0.12727962  0.024087508  setosa
## 2 -2.074013  0.6718827  0.23382552  0.102662845  setosa
## 3 -2.356335  0.3407664 -0.04405390  0.028282305  setosa
## 4 -2.291707  0.5953999 -0.09098530 -0.065735340  setosa
## 5 -2.381863 -0.6446757 -0.01568565 -0.035802870  setosa
## 6 -2.068701 -1.4842053 -0.02687825  0.006586116  setosa

pca_result$x contains the coordinates of the original observations projected onto the principal components (PCs).
Each row represents an observation (data point), and each column represents a Principal Component (PC1, PC2, etc.).

Scatter plot of the first two principal components, colored by species.

ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
  geom_point(size = 3) +
  labs(title = "PCA Projection of Iris Dataset")

Same analysis with FactoMineR and factoextra

pca_result_2 <- PCA(iris_num, graph=F)
summary(pca_result_2)
## 
## Call:
## PCA(X = iris_num, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4
## Variance               2.918   0.914   0.147   0.021
## % of var.             72.962  22.851   3.669   0.518
## Cumulative % of var.  72.962  95.813  99.482 100.000
## 
## Individuals (the 10 first)
##                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1            |  2.319 | -2.265  1.172  0.954 |  0.480  0.168  0.043 | -0.128
## 2            |  2.202 | -2.081  0.989  0.893 | -0.674  0.331  0.094 | -0.235
## 3            |  2.389 | -2.364  1.277  0.979 | -0.342  0.085  0.020 |  0.044
## 4            |  2.378 | -2.299  1.208  0.935 | -0.597  0.260  0.063 |  0.091
## 5            |  2.476 | -2.390  1.305  0.932 |  0.647  0.305  0.068 |  0.016
## 6            |  2.555 | -2.076  0.984  0.660 |  1.489  1.617  0.340 |  0.027
## 7            |  2.468 | -2.444  1.364  0.981 |  0.048  0.002  0.000 |  0.335
## 8            |  2.246 | -2.233  1.139  0.988 |  0.223  0.036  0.010 | -0.089
## 9            |  2.592 | -2.335  1.245  0.812 | -1.115  0.907  0.185 |  0.145
## 10           |  2.249 | -2.184  1.090  0.943 | -0.469  0.160  0.043 | -0.254
##                 ctr   cos2  
## 1             0.074  0.003 |
## 2             0.250  0.011 |
## 3             0.009  0.000 |
## 4             0.038  0.001 |
## 5             0.001  0.000 |
## 6             0.003  0.000 |
## 7             0.511  0.018 |
## 8             0.036  0.002 |
## 9             0.096  0.003 |
## 10            0.293  0.013 |
## 
## Variables
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## Sepal.Length |  0.890 27.151  0.792 |  0.361 14.244  0.130 | -0.276 51.778
## Sepal.Width  | -0.460  7.255  0.212 |  0.883 85.247  0.779 |  0.094  5.972
## Petal.Length |  0.992 33.688  0.983 |  0.023  0.060  0.001 |  0.054  2.020
## Petal.Width  |  0.965 31.906  0.931 |  0.064  0.448  0.004 |  0.243 40.230
##                cos2  
## Sepal.Length  0.076 |
## Sepal.Width   0.009 |
## Petal.Length  0.003 |
## Petal.Width   0.059 |

screeplot

fviz_screeplot(pca_result_2)

Biplot: displays both the observations and the variables on the same plot.
The observations are represented by points, and the variables are represented by arrows.

fviz_pca_biplot(pca_result_2, label="var", habillage=as.factor(iris$Species), addEllipses = T, repel = TRUE)

K-Means Clustering

K-means clustering is a technique used for partitioning a dataset into K clusters based on their similarity. It is an unsupervised learning method that can be used for exploratory data analysis.

Clusters are built around centroids, which are the mean of the data points in each cluster. The goal is to minimize the sum of squared distances between each data point and the centroid of its cluster, and to maximize the distance between centroids.

set.seed(123)  # for reproducibility

We use iris_num dataset, already standardized and without the species column.
As for dimensionality reduction, standardizing the data is important for k-means clustering, otherwise variables with larger scales might dominate the clustering process.

kmeans_result <- kmeans(iris_num, centers = 3, nstart = 25)
kmeans_result
## K-means clustering with 3 clusters of sizes 50, 62, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     5.901613    2.748387     4.393548    1.433871
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

💡 Notes:
Here we set 3 clusters because we know that the Iris dataset has 3 species.
But in practice, the number of clusters is often unknown and must be determined using methods like the elbow method, silhouette method, or gap statistic.

We add the cluster column to the original dataset

iris$cluster <- as.factor(kmeans_result$cluster)
table(iris$Species, iris$cluster)
##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 48  2
##   virginica   0 14 36

Comparison between the clusters and the actual species shows that the clustering algorithm has done a good job of separating the species into different clusters.

Visualizing the Clusters

We can visualize the clusters by plotting the first two principal components and coloring the points by cluster.
We indeed saw that with PCA, the first two components explain most of the variance in the data.

Remember: pca_data <- data.frame(pca_result$x, Species = iris$Species), where pca_result$x contains the coordinates of the original observations projected onto the principal components (PCs).

ggplot(pca_data, aes(x = PC1, y = PC2, color = as.factor(kmeans_result$cluster))) +
  geom_point(size = 3) +
  labs(title = "K-Means Clustering on PCA Projection", color = "Cluster")

Choose the optimal number of clusters with the Elbow Method

The Elbow Method is a heuristic technique used to determine the optimal number of clusters in a dataset. It involves plotting the within-cluster sum of squares (WCSS) against the number of clusters, and looking for the “elbow” point where the rate of decrease slows down.

We test different numbers of clusters (from 1 to 10) and calculate the WCSS for each.

wcss <- numeric(10)
for (i in 1:10) {
  kmeans_result <- kmeans(iris_num, centers = i, nstart = 25)
  wcss[i] <- kmeans_result$tot.withinss
}

Then we plot the WCSS against the number of clusters.

ggplot(data.frame(k = 1:10, wcss), aes(x = k, y = wcss)) +
  geom_line() +
  geom_point() +
  labs(title = "Elbow Method for Optimal Number of Clusters", x = "Number of Clusters", y = "Within-Cluster Sum of Squares")

The “elbow” point is where the rate of decrease slows down. In this case, it’s at k = 3, which confirms that 3 clusters is a good choice for the Iris dataset.

Sources